Data%20Science%20Life%20Cycle.png


TABLE OF CONTENTS


Business Problem Understanding

Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests?

The aim is to create meaningful estimators from the data set we have and to select the model that predicts the cancellation best by comparing them with the accuracy scores of different ML models.

Dataset

This dataset contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.

From the publication sciencedirect we know that:

Import all needed libraries

Data Cleaning and Preparation

Checking null values

Dealing with nulls in company column

A 94.31% of company column are missing values. Therefore we do not have enough values to fill the rows of company column by predicting, filling by mean etc. It seems that the best option is dropping company column.

Dealing with nulls in agent column

A 13.69% of agent column are missing values, there is no need to drop agent column. But also we should not drop the rows because 13.69% of data is really huge amount and those rows have the chance to have crucial information. There are 334 unique agent, since there are too many agents they may not be predictable.

I will decide what to do about agent after correlation section.

Dealing with nulls in children column

We have also 4 missing values in children column. If there is no information about children those customers do not have any children.

Fill nulls by 0 values.

Dealing with nulls in country column

We have also only 0.41% missing values in country column. we can simply drop them.

Exploratory Data Analysis (EDA)

Which are the most busy month?

How long do people stay at the hotels?

Most people do not seem to prefer to stay at the hotel for more than 1 week. But it seems normal to stay in Resort hotels for up to 15 days.

How many bookings were canceled?

For City Hotel, Cancelation Rate decreased when the number of days in waing list increase.

For Resort Hotel, Cancelation Rate very low

Number of cancelation for the both hotels when wating days = 0 is a huge number.

Feature Engineering

Adding new features

Adding the following features to the dataset (is_family, total_customer, deposit_given, total_nights)

Drop useless features

I created new features more expressive than this ones so I'll drop the following columns.

['adults', 'babies', 'children', 'deposit_type', 'reservation_status_date']

Handling the categorical columns

Correlation checking

Features with high correlation

* reservation_status               -0.917196
* deposit_given                    -0.481457
* total_of_special_requests        -0.234658

reservation_status seems to be most impactful feature. With that information accuracy rate should be really high.

Features with low correlation

* arrival_date_day_of_month        -0.006130
* stays_in_weekend_nights          -0.001791
* arrival_date_week_number          0.008148
* arrival_date_year                 0.016339
* agent                            -0.130010

Backing to the agent column which still have some missing values. It has nice importance on predicting cancellation by correlation (-0.130010) but since the missing values are equal to 13% of the total data it is better to drop that column.

Created features correlation

* deposit_given                    -0.481457
* is_family                        -0.013010
* total_nights                      0.017779
* total_customer                    0.046522

I will drop total_nights, is_family as it has low correlation with the cancelation

Drop features with low correlation

['total_nights','is_family','arrival_date_week_number', 'stays_in_weekend_nights', 'arrival_date_month', 'agent'

Normalize dataset

Data Modeling

Splitting dataset to train and test sets

reservation_status feature has very high corration with the cancelation action it make the model score almost 99% accuracy.

SO, I tried to build my models without this columns and compare between the diffrent models

Re-splitting without the reservation_status column

Logistic Regression (with reservation_status feature)

Logistic Regression (without reservation_status feature)

Decision Tree Classifier

XgBoost Classifier

Random Forest Classifier using Grid Search Cross Validation

Conclusion with Visualiztion